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The pattern of genetic hitchhiking under recurrent mutation 
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Abstract 

Genetic hitchhiking describes evolution at a neutral locus that is linked to a selected locus. If 
a beneficial allele rises to fixation at the selected locus, a characteristic polymorphism pattern 
(so-called selective sweep) emerges at the neutral locus. The classical model assumes that 
fixation of the beneficial allele occurs from a single copy of th is allele that arises by muta- 
tion. However, recent theory (jPennings and Hermissonl . l2006al lbl) has shown that recurrent 
beneficial mutation at biologically realistic rates can lead to markedly different polymorphism 
patterns, so-called soft selective sweeps. We extend an approach that has recently been devel- 
oped for the classical hitchhiking model (|Schweinsberg and Durrettl . l2005l : lEtheridge et al.l . 
120061 ) to study the recurrent mutation scenario. We show that the genealogy at the neutral 
locus can be approximated (to leading orders in the selection strength) by a marked Yule pro- 
cess with immigration. Using this formalism, we derive an improved analytical approximation 
for the expected heterozygosity at the neutral locus at the time of fixation of the beneficial 
allele. 



1 Introduction 



The model of genetic hitchhiking, introduced bv lMavnard Smith and Haighl (|1974( ). describes the 
process of fixation of a new mutation due to its selective advantage. During this fixation process, 
linked neutral DNA variants that are initially associated with the selected allele will hitchhike 
and also increase in frequency. As a consequence, sequence diversity in the neighborhood of the 
selected locus is much reduced when the beneficial allele fixes, a phenomenon known as a selective 
sweep. This characteristic pattern in DNA sequence data can be used to dete ct genes that have 
been adaptive targets in t he recent evolution ary history by statistical tests (e.g. iKim and Stephan 
2OO2I: iNielsen et all [2OO5H Jensen et al.ll2007t ). 

Since its int roduction, several analytic approximations to quantify the hitchhiking effect have 
been developed (iKaplan et~al Vl989';'st ephan et al.l . Il992l : iBartonl . Il998l : ISchweinsberg and Durret"3 . 
20051: lEtheridge et al.Ll2006r Eriksson et al.l. |2008[) . The mathematica l analysis of selective sweeps 
makes use of the coalescent framework ( KingmanT 1982 : Hudson . 1983[ ). which descri bes the geneal 



ogy of a population sample backward in time. Most studies follow the suggestion of iKaplan et al 



(1 19891) and use a structured coalescent to describe the genetic footprint at a linked neutral locus, 
conditioned on an approximated frequency path of the selected allele. In this approach, population 
structure at the neutral locus consists of the wild-type and beneficial b ackground at the sel ected lo- 
cus, respectively. A mathematical rigorous construction was gi ven by Barton et al.l (2004 ). More - 
over, a structured ancestral recombination graph was used in iPfaffelhuber and Studenvl ( 20071 ): 
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McVeaij (120071 ): iPfafFelhuber et alj (j2008f ) to describe the common ancestry of two neutral loci 
linked to the beneficial allele. 

It has long been noted that the initial rise in frequency of a be neficial allel e is similar to 
the evolution of the total mass of a supercritical branching process (jFisher Il930l: iKaplan et al 



19891: iBartonl l ll998HEwen^ I2OO4I . p. 27f) This insight led to the approximation of the structured 
coalescent by the genealogy of a supercritical branching process — a Yule process (IO'Conneli[T99i 
Evans and 0'Conneli[l99l . Given a selection intensity of a and a recombination rate of p between 
the selected and neutral locus, it has been shown that a Yule process with branching rate a, which 
is marked at rate p a nd stopped upon reaching \ 2a\ l ines, is an accurate approximation of the 
struc tured coalescent (jSchweinsberg and Durrettl . boOSHEtheridge et allliooalPfaffelhuber et all 
20061) . For the standard scenario of genetic hitchhiking, this approach leads to a refined analytical 



approximation of the sampling distribution, estimates of the approximation error and to efficient 
numerical simulations. 

The classical hitchhiking model assumes that adaptation occurs from a single origin of the 
beneficial allele. An explicit mutational process at the selected locus, where the beneficial allele 
can enter the population recurrently, is not taken into account. However, it has recently been 
demonstrated that recurrent beneficial mutation at a biologically realistic rate can lead to consid- 
erable changes in the selective f ootprint in DNA sequence data (jHermisson and Penning 2. l200,4 
Pennings and Hermisson . 2006al lbf). In the present paper, we extend the Yule process approach of 
Etheridge et al.l (|2006l ) to the full biological model with recurrent mutation at the beneficial locus. 
Specifically, we show that the genealogy at the selected site can be approximated by a Yule process 
with immigration. Our results can serve as a basis for a detailed analysis of patterns of genetic 
hitchhiking under recurrent mutation, such as the site-frequency spectrum and linkage disequilib- 
rium patterns. As an example of such an application, we derive the expected heterozygosity in 
Section 



The paper is organized as follows. In Section[2l we introduce the model as well as the structured 
coalescent and we discuss the biological context of our work. In Section [3] we state results on the 
adaptive process, give the approximation of the structured coalescent by a Yule process with 
immigration and apply the approximation to derive expressions for the heterozygosity at the 
neutral locus at the time of fixation. In Sections SI [5] and [6] we collect all proofs. 



2 The model 

We describe evolution in a two-locus system, where a neutral locus is linked to a locus experiencing 
positive selection. In Section [^TTl we first focus on the selected locus and formulate the adaptive 
process as a diffusion. In Sect ion [2T2l we describe the genealogy at the neutral locus by a structured 
coalescent. In Section [2?3l we discuss the biological context. 

2.1 Time-forward process 

Consider a population of constant size N . Individuals are haploid; their genotype is thus char- 
acterized by a single copy of each allele. Selection acts on a single bi-allelic locus. The ancestral 
(wild- type) allele b has fitness 1 and the beneficial variant B has fitness 1 -I- s, where s > is 
the selection coefficient. Mutation from 6 to i? is recurrent and occurs with probability u per 
individual per generation. Let X-t be the frequency of the B allele in generation t. In a standard 
Wright-Fisher model with discrete generations, the number of B-alleles in the offspring generation 
t -I- 1 is NXt+i, which is binomially distributed with parameters ^(\^^x't+'i~Xt ^ ^^'^ 

We assume that the beneficial allele B is initially absent from the population in generation 
t = when the selection pressure on the B locus sets in. Since the B allele is created recurrently by 
mutation and we ignore back-mutations it will eventually fix at some time T, i.e. Xt = 1 for t > T. 
This process of fixation can be approximated by a diffusion. To this end, let = (^/^)t=o,i,2,... 
with X^ = be the path of allele frequencies of B. 
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Assu ming u = ^ 0, s = sjv — > such that 2Nu s- 9, Ns a as N oo, it is well-known 

N \ 
l2Nt\ /*>0 



(sec e.g. lEwensI |2004[ ) that (-'^/^ivt i )t>o ^ {Xt)t>Q as N ^ oo where X {Xt)t>o follows the 



dX = (§(1 -X) + aX{l - X))dt + y/X{l - X)dW (2.1) 

with Xo = 0. In other words, the diffusion approximation of is given by a diffusion X with 
drift and diffusion coefficients 

fJ'a,e{x) = (f + cex){l — x), '^'^{x) = x{l — x). 

We denote by g[.] and g[.] the probability distribution and its expectation with respect to 
the diffusion with parameters ^a,e and and Xq — p almost surely. The fixation time can be 
expressed in the diffusion setting as 



T := mi{t >0: Xt = l}. (2.2) 



2.2 Genealogies 



We are interested in the change of polymorphism patterns at a neutral locus that is linked to 
a selected locus. We ignore recombination within the selected and the neutral locus, but (with 
sexual reproduction) there is the chance of recombination between the selected and the neutral 
locus. Let the recombination rate per individual be p in the diffusion scaling (i.e. r = is the 
recombination probability in a Wright-Fisher model of size N and rjy "'^^°°> and iVrjv ^^^> p) . 
Not all recombination events have the same effect, however. We will be particularly interested 
in events that change the genetic background of the neutral locus at the selected site from B 
to b, or vice-versa. This is only possible if B individuals from the parent generation reproduce 
with b individuals. Under the assumption of random mating, the effective recombination rate in 
generation t that changes the gen etic background is thus pXt{l ~ Xt) in the diffusion setting. 



Following iBarton et al.l (120041 ). we use the structured coalcscent to describe the polymorphism 
pattern at the neutral locus in a sample. In this framework, the population is partitioned into 
two demes according to the allele (B or b) at the selected locus. The relative size of these demes 
is defined by the fixation path X of the B allele. Only lineages in the same deme can coalesce. 
Transition among demes is possible by either recombination or mutation at the selected locus. We 
focus on the pattern at the time T of fixation of the beneficial allele. Throughout we fix a sample 
size n. 

Remark 2.1. We define the coalescent as a process that takes values in partitions and introduce 
the following notation. Denote by S„ the set of partitions of {1, n}. Each ^ € E„ is thus a set 
^ = {^1, ^|^|} such that Uill]^ = {l?---;'^} and n ^ j =0 for i ^ j. Partitions can also be 
defined by equivalence relations and we write fc ~j iff there is 1 < i < |^| such that k,£ € ^i. 
Equivalently, ^ defines a map ^ : {!,..., n} {!,..., |^|} by setting ^{k) = i iff fc G ^i. We will 
also need the notion of a composition of two partitions. If ^ is a partition of {1, ...,n} and i] is a 
partition of {1, |^|}, define the partition ^ o ry on {1, n} by k ~jo7) ^ iff (,{k) C(^)- CH 

Setting /3 = T — t we are interested in the genealogical process = {£,i3)o<f3<T of a sample of 
size n, conditioned on the path X of the beneficial allele B. The state space of is 

^n:={(c^,e''):e^ue''eSn}. 

Elements of (^'') arc ancestral lines of neutral loci that are linked to a beneficial (wild- type) 
allele. Since there are only beneficial alleles at time T, the starting configuration of is 

= ({!},.. .,w,0). 

For a given coalescent state = {^^,£,^) at time (3, several events can occur, with rates that 
depend on the value of the frequency path X at that time, Xt-0- Coalescences of pairs of lines 
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event 


coal in B 


coal in b 


mut from B to b 


rec from B to b 


rec from b to B 


rate 


1 


1 

l~Xt 


e i-Xt 

2 Xt 


p(i - Xt) 





Table 1: Transition rates in the process ^ at time t = T — (3. Coalescence rates are equal for 
all pairs of partition elements in the beneficial and wild-type background. Recombination and 
mutation rates are equal for all partition elements in and . 



in the beneficial (wild-type) background occur at rate l/Xx-p (1/(1 — Xt-p))- Formally, for all 
pairs 1 < i < j < and 1 < i' < f < |^^|, transitions occur at time (3 to 



{{^ \ {if. if}) U {ef U }, e) with rate 



1 



{e,{e\{£!l',er})^{il^er}) with rate - ^ 



(2.3) 



Changes of the genetic background happen either due to mutation at the selected locus or recom- 
bination events between the selected and the neutral locus. For 1 < i < |^^|, transitions of genetic 
backgrounds due to mutation occur at time (3 from ~ for I <i < \^^\ to 

r\Uf},e^U{ef}) withrate f^-^. (2.4) 

2 Xt-/3 

(Recall that we assume that there arc no back- mutations to the wild-type). Moreover, changes of 
the genetic background due to recombination occur at time /? for 1 < i < |^^|. 1 < «' < |^^| from 

(e''\{ef},e''UUf}) withrate p{1-Xt-p) (2.5a) 
(e''u{e?.},e''\{ef'}) withrate pXr-p. (2.5b) 

All rates of ^'^ arc collected in Table [T] 
Remark 2.2. 

1. The rates for mutation and recombination can be understood heuristically. Assume X^^ ~ x 
and assume m, s, r are small. A neutral locus linked to a beneficial allele in generation t + 1 
falls into one of three classes: (i) the class for which the ancestor of the selected allele was 
beneficial has frequency x + 0{u^ s, r); (ii) the class for which the beneficial allele was a wild- 
type and mutated in the last generation has frequency u{\ — x) + 0{us,ur)] (iii) the class 
for which the neutral locus was linked to a wild-type allele in generation t and rccombined 
with a beneficial allele has frequency rx{l — x) 0(rM, rs)). Hence, if wc arc given a neutral 
locus in the beneficial background, the probability that its linked selected locus experienced 
a mutation one generation ago is "^"^"^^ -I- ©(u^, us, ur) and that it recombined with a wild- 
type allele one generation ago is -|- ©(ru, rs, r^). Thus, the rates \2A\ and (|2.5a|) 
arise by a rescaling of time by N . 

2. In p.3|) and (|2.4|) the rates have s higularities w h en X r-a = 0. However , we will show in 



Lemma 15.31 using arguments from iBarton et al.l (j200J) and iTavlorl (j2007l ) that a line will 
almost surely leave the beneficial background before such a singularity occurs. In particular, 
the structured coalescent process is well-defined. 
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2.3 Biological context 



A selective sweep refers to the reduction of sequence diversity and a characteristic polymorphism 
pattern around a positively selected allele. Models show that this pattern is most pronounced close 
to the selected locus if selection is strong and if the sample is taken in a short time window after the 
fixation of the beneficial allele (i.e. before it is diluted by new mutations). Today, biologists try to 
detect swee p patterns i n genom e-wide polymorphism scans in order t o iden tify recent adaptation 
events fe.g. iHarr et al.l . i2002; .Ometto et all . l2005l : IWilliamson et al.l . l2005l ). 

The detection of sweep regions is complicated by the fact that certain demographic events in 
the history of the population (in particular bottlenecks) can lead to very similar patterns. Vice- 
versa, also the footprint of selection can take various guises. In particular, recent theory shows 
that the pattern can change significantly if the beneficial allele at the time of fixation traces back 
to more than a single origin at the start of the selective phase (i.e. there is more than a single 
ancestor at this time). As a consequence, genetic variation that is linked to any of the successful 
origins of the beneficial allele will survive the selective phase in proximity of the selective target 
and the reduction in diversity (measured e .g. by the numb er of segregating sites or the average 
heterozygosity in a sample) is less severe. IPennings and He rmisson (2006a) therefore called the 
resulting pattern a soft selective sweep in distinction of the classical hard sweep from only a single 
origin. Nevertheless, also a soft sweep has highly characteristic f eatures, such as a more pronounce d 
pattern of linkage disequilibrium as compared to a hard sweep (jPennings and Hermissonl . l2006bl) . 

Soft sweeps can arise in several biological scenarios. For example, multiple copies of the bene- 
ficial allele can already segregate in the population at t he start of the selective phase (adaptation 
from standing genetic variation; iHermisson and Penni ngs 2005: Przcworsk i et al. 2005). Most 
naturally, however, the mutational process at the selected locus itself may lead to a recurrent in- 
troduction of the beneficial allele. Any model, like the one in this article, that includes an explicit 
treatment of the mutational process will therefore necessarily also allow for soft selective sweeps. 
For biological applications the most i mportant question then is: When are soft sweeps from re- 
current mutation likely? The results of IPennings and Hermisson ( 2006a ) as well as Theorem [T] in 
the present paper show that the probability of soft selective sweeps is mainly dependent on the 
population- wide mutation rate 9. The classical results of a hard sweep are reproduced in the limit 
9 and generally hold as a good approximation for 9 < 0.01 in samples of moderate size. For 
larger 9, approaching unity, soft sweep phenomena become important. 

Since 9 scales like the product of the (effective) population size and the mutation rate per allele, 
soft sweeps become likely if either of these factors is large. Very large population sizes are primarily 
found for insects a nd microbial organisms. C onsequently, soft sweep patterns have been found, 
e.g., in Drosophila (jSchlenke and Begun . l2004l) and in the malaria parasite Plasmodium falsiparum 
( Nair et all . l2007h . Since point mutation rates (mutation rates per DNA base per generation per 
individual) are typically very small (~ 10~*), large mutation rates are usually found in situations 
where many possible mutations produce the same (i.e. physiologically equivalent) allele. This 
holds, in particular, for adaptive loss-of-function mutations, where many mutation s can destroy 



the fu nction of a gene. An example is the loss of pigmentation in Drosophila santomea (jJeong et al 
20081 ). But also adaptations in regulatory regions often have large mutation rates and can occur 
recurrently. A well-known example is the evol ution of adult lactose tolerance in humans, where 



several mutational origins have been identified (jTishkoff et al.l . 120071 ) 



Several extensions of the model introduced in Section [5] are possible. In a full model, we should 
allow for the possibility of back-mutations from the beneficial to the wild-type allele in natural 
populations. However, such events are rarely seen in any sample because such back-mutants have 
lower fitness and are therefore less likely to contribute any offspring to the population at the time 
of fixation. Another step towards a more realistic modeling of genetic hitchhiking under recurrent 
mutation would be to allow for beneficial mutation to the same (physiological) allele at multiple 
different positions of the genome. In such a model, recombination between the different positions 
of the beneficial mutation in the genome would complicate our analysis. 
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3 Results 

The process of fixation of the beneficial allele is described by the diffusion (|2.f p . In Section 13.11 
we will derive approximations for the fixation time T of this process. These results will be needed 
in Section [3.21 where we construct an approximation for the structured coalescent 



3.1 Fixation times 

In the study of the diffusion (|2.ip the time T of fixation of the beneficial allele (see (|2.2p ) is of 
particular interest. We decompose the interval [0;T] by the last time a frequency oi Xt = was 
reached, i.e., we define 



To := sup{f > : = 0}, T*:=T-T, 



0- 



Note that for 6' > 1. the boundary a; = is inaccessible, such that Tq = 0, T* = T, almost surely, 
in this case. 

Proposition 3.1. 1. Let 7e ~ 0.57 he Euler's 7. For 9 > 0, 

KAT] - ^(21og(2a) 4- 27e -f i -0±^) + o{^) + (3.1) 

n— 1 ^ ^ 

2. For 9 > 1, almost surely, T = T* . 

3. For0<9< 1, 

4. For 9>Q, 



r] = ^(log(2a)+7e)+o(^) (3.2) 



r%[T*]^o(^). (3.3) 



All error terms are in the limit for large a and are uniform on compacta in 9. 
Remark 3.2. 

1. Note that ([51^ reduces to for 6* = 1 as it should since To 0. 

2. For 9 < 1, we find that g[T*] is independent of 9 to the order considered. In particular 
it is identical to the co nditioned fixation time without r e current mutation (9 = 0) that 
was previously derived (Ivan Herwaarden and van der "Wa3 . 12OO2I : IHermisson and Penningsl . 
2005 : Etheridge et al. . 20061 ). A detailed numerical analysis (not shown) demonstrates that 



the passage times of the beneficial allele decrease at intermediate and high frequencies, but 
increase at low frequencies X < l/a where recurrent mutation prevents the allele from dying 
out. Both effects do not affect the leading order and precisely cancel in the second order for 
large a. 

3. To leading order in 1/9 and a, the total fixation time (|3.ip is 

<o[T]^^ + E%[T*]. 

Since the fixation probability of a new beneficial mutation is Pfix ~ 2s and the rate of new 
beneficial mutations per time unit (of TV generations) is N9/2, mutations that are destined for 
fixation enter the population at rate sN9 ~ a9. The total fixation time thus approximately 
decomposes into the conditioned fixation time E[r*] and the exponential waiting time for 
the establishment of the beneficial allele ^ . 



4. In applications, selective sweeps are found with a > 100. We can then ignore the error term 
^(^(ae""/^) in (|3.ip even for extremely rare mutations with 9 ^ 10^^". 

5. The proof of Proposition 13.11 can be found in Section |4l 
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2a lines 




' recombination 



Figure 1: The Yule process approximation for the genealogy at the neutral locus in a sample of 
size n ~ 6. The Yule process with immigration produces a random forest (grey lines) which grows 
from the past (past) to the present (top). A sample is drawn in the present. Every line is marked 
at constant rate along the Yule forest indicating recombination events. Sample individuals within 
the same tree not separated by a recombination mark share ancestry and thus belong to the same 
partition element of T. In this realization, we find T = {{1}, {2, 3}, {4}, {5, 6}}. 

3.2 The Yule approximation 

We will provide a useful approximation of the coalescent process with rates defined in (|2.3P " (|2.5p . 
As already seen in the last section the process of fixation of the beneficial allele can be decomposed 
into two parts. First, the beneficial allele has to be established, i.e., its frequency must not hit 
any more. Second, the established allele must fix in the population. The first phase has an 
expected length of about l/{a9) and hence may be long even for large values of a, depending 
on 9. The second phase has an expected length of order (loga)/a and is thus short for large 
a, independently of 9. For the potentially long first phase we give an approximation for the 
distribution of the coalescent on path space by a finite Kingman coalescent. For the short second 
phase, we obtain an approximation of the distribution of the coalescent (which is started at time 
T) at time To using a Yule process with immigration (which constructs a genealogy forward in 
time). To formulate our results, define 

/3o T - To. 

Setting Xt = for i < we will obtain approximations for the distribution of coalescent states at 
time Po, 

and of the genealogies for /3 > /?o, i.e. in the phase prior to establishment of the beneficial allele. 



(>f3o ■— (^>fl(,'?>f3n) ■ — 



Note that ^^(, £ §„ while (,>i3o € I?([0; oo), S„), the space of cadlag paths on [0;oo) with values in 



Let us start with (see Figure [T] for an illustration of our approximation) . Consider the 
selected site first. Take a Yule process with immigration. Starting with a single line, 
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• every line splits at rate a. 

• new lines (mutants) immigrate at rate aO. 

For this process we speak of Yule-time i for the time the Yule process has i lines for the first time. 
We stop this Yule process with immigration at Yule-time [2q;J . In order to define identity by 
descent within a sample of n lines, take a sample of n randomly picked lines from the [2aJ . Note 
that the Yule process with immigration defines a random forest J- and we may define the random 
partition T of {1, ...,n} by saying that 

k £ <==^ k, £ are in the same tree of J^. 

As a special case of Theorem [1] we will show that T is a good approximation to in the case 

In order to extend the picture to the general case with recombination, consider a single line 
of the neutral allele at time T. The line may recombine in the interval [To,T] and thus have an 
ancestor at time Tq, which carries the wild-type allele. Since recombination events take place with 
a rate proportional to p and T — Tq = T* is of the order (logQ;)/a, it is natural to use the scaling 

P = lT^- (3.4) 
log a 

Take a sample of n lines from the [2a\ lines of the top of the Yule tree and consider the subtree of 
the n lines. Indicating recombination events, we mark all branches in the subtree independently. 
A branch in the subtree, which starts at Yule-time ii and ends at Yule-time i2 is marked with 
probability 1 — p*J(7,0), where 

p:j(7,0):=exp(--^ J2 -^)- (3.5) 

Then, define the random partition T of {1, n} (our approximation of ^^„) by 

k ~x £ <^=^ k ^j. £ A path from k to £ in J-' not separated by a mark. 



To obtain an approximation of ^>p„ consider the finite Kingman coalescent C :~ {C't)t>o- Given 
there are m lines such that Ct ~ C ^ {Ci, Cm}, transitions occur for 1 < 1 < j < m to 

(C \ {C, Cj}) U {C, U Cj} with rate 1. 

Given T, our approximation of ^>^„ is 

ToC := {ToCt)t>o- 

Remark 3.3. Our approximations are formulated in terms of the total variation distance of 
probability measures. Given two probability measures P, Q on a cr-algebra A, the total variation 
distance is given by 

dTy(P,Q) = isup |P[^]-Q[^]|. 

AeA 

Similarly, for two random variables X,Y on fl with a{X) = a{Y) and distributions C{X) and 
C(Y) we will write 

dTv{X,Y) ^ dTv{CiX),C(Y)). 
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Theorem 1. 

1. The distribution of coalescent states at time /Jg under the full model can be approximated 
by a distribution of coalescent states of a Yule process with immigration. In particular, 

PmK|o=0] = 1 (3.6) 

and the bound 

holds in the limit of large a and is uniform on compacta in n,7 and 9. 

2. The distribution of genealogies ^>/3„ prior to the establishment of the beneficial allele can be 
approximated by the distribution of genealogies under a composition of a Yule process with 
immigration and the Kingman coalescent. In particular, 



1 



a log a 



and the bound 



dMe^,.ToC)^o{-l--) (3.: 



.(log a) 2 

holds in the limit of large a and is uniform on compacta in n,7 and 9. 
Remark 3.4. 

1. Let us give an intuitive explanation for the approximation of the genealogy at the selected 
site by T. Consider a finite population of size N. It is well-known that a supercritical 
branching process is a good approximation for the frequency path X at times t when 
is small. In such a process, each individual branches at rate 1. It either splits in two with 
probability or dies with probability In this setting every line has a probability 

of 2s + 0{s^) « 2a/N to be of infinite descent. In particular, new mutants that have an 
infinite line of descent arise approximately at rate 2s . Nu = a9/N. In addition, when there 
are 2Ns lines of infinite descent there must be approximately N lines in total, which is the 
whole population. 



2. Using t he approximation of ^fs^ by T we can immediately derive a result found in lPennings and Hermisson 



(|2006bl ): when the Yule process has i lines the probability that the next event (either a split 
of a Yule line or an incoming mutant) is a split is and that it is an incoming mutant 
is . This implies that the random forest is gener ated by Hoppe's urn. R ecall also the 
related Chinese restaurant process; see lAldousI (|1985[ ) and I Jovce and Tavarel (|l987[ ). The 



resulting sizes of all families is given by the Ewens' Sampling Formula for the [2aJ lines 
when the Yule tree is stopped. Moreover, the Ewens' Sampling Formula is consistent, i.e., 
subsamples of a large sample again follow the formula. 

When biologists screen the genome of a sample for selective sweeps, they can not be sure 
to have sampled at time t = T. Given they have sampled lines linked to the beneficial type 
at t < r when the beneficial allele is already in high frequency (e.g. Xt ^ 1 ~ 5/ log a 
for some ^ > 0), the approximations of Theorem [T] still apply. The reason is that neither 
recombination events changing the genetical background nor coalescences occur in [t] T\ in 
^with high probability; see Section [6^ If t > T, a good approximation to the genealogy is 
C o T o C where C is a Kingman coalescent run for time t — T. 
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4. The model parameters n, 7 and 9 enter the error terms 0{.) above. The most severe error 
in (j3.7p arises from ignoring events with two recombination events on a single line. See also 
Remark 15.41 Hence, 7 enters the error term quadratically. Since each line might have a 
double-recombination history, the sample size n enters this error term linearly. The contri- 
bution of 6 to the error term cannot be seen directly and is a consequence of the dependence 
of the frequency path X on 9. 

Note that coalescence events always affect pairs of lines while both recombination and mu- 
tation affects only single lines. As a consequence, n enters quadratically into higher order 
error terms. In particular, for practical purposes, the Yule process approximation becomes 
worse for big samples. 

5. The proof of Theorem[T]can be found in Section[6l Key facts needed in the proof are collected 
in Section O 



3.3 Application: Expected heterozygosity 

The approximation of Theorem 1 using a Yule forest as a genealogy has direct consequences for the 
interpretation of population genetic data. While genealogical trees cannot be observed directly, 
their impact on measures of DNA sequence diversity in a population sample can be described. 
The idea is that mutations along the genealogy of a sample produce polymorphisms that can be 
observed. Genealogies in the neighbourhood of a recent adaptation event are shorter, on average, 
meaning that sequence diversity is reduced. This reduction is stronger, however, for a 'hard sweep' 
(see Section 2.3), where the sample finds a common ancestor during the time of the selective phase 
E[T*] « 21og(Q;)/a ^ 1 than for a 'soft sweep', where the most recent common ancestor is older. 
Using our fine asymptotics for genealogies, we are able to quantify the prediction of sequence 
diversity under genetic hitchhiking with recurrent mutation. In this section we will concentrate 
on heterozygosity as the simplest measure of sequence diversity. 

By definition, heterozygosity is the probability that two randomly picked lines from a popula- 
tion are different. Writing Ht for the heterozygosity at time t and using (j3.6[) . we obtain 

Ht = P.,.[4 = {{!}, {2}}|eo = ({1}, {2}, 0)] • Hr„. 

Assuming that the population was in equilibrium at time 0, we can use Theorem (TJ in particular 
p.7p . to obtain an approximation for the heterozygosity at time T. 

Proposition 3.5. Abbreviating pi :~ p\'^°'\'j,9) (compare p.Sp ). heterozygosity at time T is 
approximated by 

1 „ |2c(l 

Ht pI 27 2i 



Hto 9 + 1 loga § (i + 0)2(j + l + 0)^^ + ^((loga)2) ^^'^^ 

where the error is in the limit of large a and is uniform on compacta in n,"f and 9. 



Remark 3.6. 

1. The formula (13.91) establishes that 



+ (3,10) 



Hto 9 + 1 Vloga 

In particular, to a first approximation, two lines taken from the population at time T are 
identical by descent if their linked selected locus has the same origin (probability jj^) and 
if both lines were not hit by independent recombination events (probability pi). 
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Figure 2: Reduction in heterozygosity at time of fixation of the beneficial allele. The a;-axis shows 
the recombination distance of the selected from the neutral locus. Solid lines connect results from 
the analytical approximation. Dotted lines show simulation results of a structured coalescent in 
a Wright-Fisher model with N ~ IC* and a = 1000. Small vertical bars indicate standard errors 
from 10'^ numerical iterations. 



We investigated the quality of the approximation (j3.9p by numerical simulations. The out- 
come can be seen in Figure [51 As we sec, for a ~ 1000, our approximation works well for all 
values of < 1 up to p/a = 0.1, i.e., 7 = 0.7. 

We can compare Proposition 13. 51 with the result for the heterozygo sity under a star-like ap- 
proxim ation for the genealogy at the selected site, which was used bv lPennings and Hermisson 



(|2006bl . eq. (8)), 



Note that this formula also arises approximately by taking p^^"^ (7, 0) instead of pi in ()3.10p . 
As shown in Tabled the additional terms from the Yule process approximation lead to an 
improvement over the simple star-like approximation result. 

4. The quantification of sequence diversity patterns for selective sweeps with recurrent mutation 
using the Yule process approximation is not restricted to heterozygosity. Properties of several 
other statistics could be computed. As an example, we mention the site frequency spectrum, 
which describes the number of singleton, doubleton, tripleton, etc, mutations in the sample. 

Moreover, as pointed out by Pennings and Hermisson (2006b), selective sweeps with recurrent 
mutation also lead to a distinct haplotype pattern around the selected site. Intuitively, every 
beneficial mutant at the selected site brings along its own genetic background leading to 
several extended haplotypes. Quantifying such haplo types patterns would require models 
for more than one neutral locus. 

Proof of Provosition \3.5\ Using Theorem [1] we have to estabhsh that Pa^eK^^ = {{l:2}}|^o = 
({1},{2},0)] is approximately given by the right hand side of l|3.5p . To see this, we compute, 
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(7 — U, p — z 


(7 — U, p — 


u — U, p — iU 


(7 — U, p — OU 


WF-model 


0.024 


0.058 


0.108 


0.475 


no Cih 


n noo / 1 tO/ \ 
(J.Uzo(l ( /o) 


(j.uby^iy/oj 


\}A66[Z6/o ) 


U.5L)4(D7o ) 


l|o.ii|l 


u.yjoZyoo /Q ) 


n 070/' "^aq/ ^ 

U.U ( y (^oD/o j 


O 1 { AC\^ \ 
U. iOi (^4U /o j 


i^i^o/'i ^ 
u.ooy(^io/o ) 




9 ^ 0.1, p= 2 


e' = o.i,p = 5 


6' = 0.1,p= 10 


9 = 0.1,p== 50 


WF-model 


0.112 


0.153 


0.223 


0.507 


(13:911 


0.116(4%) 


0.152(1%) 


0.209(6%) 


0.541(7%) 


(|3.11|) 


0.12(7%) 


0.162(6%) 


0.228(2%) 


0.599(18%) 




e = i,p = 2 


61 = l,p = 5 


9^1, 10 


6* = l,p = 50 


WF-model 


0.524 


0.523 


0.554 


0.723 




0.512(2%) 


0.529(1%) 


0.556(0%) 


0.722(0%) 


(ja.llj) 


0.516(2%) 


0.539(3%) 


0.575(4%) 


0.779(8%) 



Table 2: Comparison of numerical simulation of a Wright-Fisher model to (|3.9p and (|3.1ip . Num- 
bers in brackets are the relative error of the approximation. For 9 = Q and 6' = 1, the same set of 
simulations as in Figure [5] are used. In particular, TV = 10^ and a = 1000. 



accounting for all possibilities when coalescence of two lines can occur, 

\2a\ „ I 2a I , w n 

^ V 2pf W / J-1 (j-l)(j + 2) 

^ (^ + 0)(z + 1) ■ ^.11^ I (j + 9)ij + 1) + (j + 9){j + 1) 

^ (* + 0)(* + l) Ai^j + 1 J +6^ 

1 2a I „ 9 

= yM ^ + ofi 

^ i + 6i(i + l + 6')(i + 2 + 6i) Va 

L2«J 



V(z + i + 0)(i + 2 + 0) " (7T^)(7TTT^)(7T2T^J^* + 
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Rewriting gives 

L2«J 

P[T = {{1,2}}] = ^ 



i + 1 + 9 i + 2 + I 



I2a\ 2 2 1 

y f ^ )+o^' 

^^\{i + 9){i + l + 9) {i + l + e){i + 2 + 9)J 



2pl , gJ2b?^,-pf) 



61 + 2 ^ i + 2 



Pi /) Pi+1 Pi , /o / -"^ 



[9 + l){9 + 2) {^^{i + l + 9){i + 2 + 9)^ 



pi , 2z + + 2 ,2 2x^,0^1 



1=1 



^ 2 2^ + 6> / 1 \ 

6+1' loga (TT^HrrTT^ ^ V(loga)2j 



where the last equality follows from 

\ogai + l + 9n ^'^'loga z + 1 + 6* ' \{\oga) 



2 22/ f '^1 ^ \\ 2 "^1 1 

V V logaz + l + C^// logaz + l + C^ i"' Viloga)^/ 



□ 



4 Proof of Proposition 13.11 (Fixation times) 

Our calculations are based on the Green function t{.\ .) for the diffusion X = {Xt)t>[)- This 
function satisfies 



I{Xt)dt 



t{x;p)f{x)dx 



(4.1) 



and 



Using 



Jt 



f{Xt)g{X,)dsdt 



"'0 



t{x;p)t{y; x)f{x)g{y)dydx. 



i'a.eiy) ■=->Piy) exp 



2/ ^dz)=^ 



-^exp(2a(l -y)) 



(4.2) 



the Green function for A", started in p, is given by (compare lE"wensl (|2004l ). (4.40), (4.41)) 



ta,d{x;p) = 



(1 - x ) 



-2a(y~x) 



Vp 



(;)*■ 



Since T* depends only on the path conditioned not to hit 0, wc need the Green function of the 
conditioned diffusion. To derive its infinitesimal characteristics, we need the absorption probability, 
i.e., given a current frequency of p of the beneficial allele, its probability of absorption at 1 before 
hitting 0. This probability is given by 



pUp) 



-dy 



So^{y)dy lo^-p^dy 
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for 9 < 1. For 9 > 1. wc have P^ g — 1. i.e., is an inaccessible boundary. In the case 9 < 1, the 
Green function of the conditioned process is for p < x (compare Ie^ct^ (20041) . (4.50)) 

and for x<p (see lEwensI (|2004l ). (4.49)) 

2 ii-PUA)PU^) 



.^(.)^(.) pUp) 7o ^^^^'^ 
^ 1 Jp Ay)dy Jo Ay)dy Jq i'{y)dy 

^HAAA Ay)dy ^{y)dy ' 



Before we prove Proposition 13.11 we give some useful estimates. 
Lemma 4.1. 1. For e,K E (0; oo) there exists C eM. such that 



sup 

e<x<l,0<e<K 



9(1 -x) 



< c. 



2. For 9 e [0; 1), 



where r(.) is the Gamma function. 
3. The bounds 



dz=—^T{\-9)+0{e-^^) 



1 1-e- 



■2a.x 



— rfx - log 2a + 7e = O - 



1\ 1 

a 
1 



1-x 



1 



1 /x 



l-x\y 



-2a{y-x) 



dxdy ^ O 



hold in the limit of large a, and uniformly on compacta in 9. 
Proof. 1. By a Taylor approximation oi x i-^ x^ around a; = 1 wc obtain 

= i+9{i-x) + '-^e-''a~^f 

for some a; < ^ < 1 and the result follows. 
2. We simply compute 



'dz 



1 





3. For (|4.5p . we write 

nl 



2a 



X 



(2a)i- 



^dz = 



(2a 



^r(l-0) + O(e-2") 







6* 7o 



(4.3) 



(4.4) 



(4.5) 
(4.6) 
(4.7) 
(4.8) 



(4.9) 



i-^ I' e-^^'^dx + oi^^e-'' + 2a j\l - x)e-^"^^-''^dx) 
0(l)+i0(ae-") 
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where we have used 1. for e = i. For (|4.6p . see ( Bronstein . 19821 p. 61). Equation (|4.7p 
foUows from 

7o - Jo l~x j^Jo -2a 

and (|4.8p from 

(_) e-2"(y-^)d2:dy < 2 / / e-^"^y-''^dxdy 

_ 1 -x Vy/ Jo Jo 

2a J Q \a^ 

Lemma 4.2. Lei 2q; > 1. There is C > Q such that for all 9 G [0; 1] and x £ [0; 1] 

PU^) < {Ci2axy-0) A 1. 
Proof. By a direct calculation, we find 

Jo ^^2/ Jo ^ 

Moreover, since P^ g{x) is a probability, the bound gix) < 1 is obvious. □ 



□ 



Proof of Proposition \3.1[ We start with the proof of p.ip . i.e., we set / = 1 in (|4.ip . We split the 
integral of Ea.e[T] by using = J + T^' i-^-' 



E„e[r] = 2 /' r -(-)\-^^'^y--^dxdy + 2 t r ^—(-)\-^^iy-)dxdy. 
Jo Jo X \yJ Jo Jo l-x \yJ 



For the first part 



i /';/-i^(l-e-2"(i-))d^ 
a 7o 1 - a; ' 

^ ' ^ (l-e-2"(i--))dx+ / dx)+0(^)+^0{ae-") 





1/1 °° '■^ 



a\9 J r,l~X ./n 1 — X 



n— 1 ^ ^ 

where we have used (|4.5p in the fourth and both, (|4.6p and (|4.7p in the fifth equahty. The second 
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part gives, using (|4.8p and (|4.3p . 



^0 



1 /-^/^ 1 / z 



JQ 



1 — y + x\ y 



1 - - ) e-^°"'dxdy 



^ ry ^ / ry r / i 1 \ \ / 1 



= 2/ / e-^^f^^^'dydx + of / xlogf- ^ ) e-^°'''dx) +o(^) 

Jq Jo 1-x VJo Vl-y + x/ y=x I I 

^'-r'-^-^dx+o(^ r.iog(^)e"^d.)+o(i,) 

a Jq 1 — x \a'^ Jo \x/ / Va^/ 

(4.10) 

and l|3.ip follows. For the proof of (|3.2p we have 



By using ^ + again split the integral in the numerator. For the i-part we find 



^m\-^-~dzdydx-^'^2 / / ^ 7^)V-dzdydx 
2 f f /%-2"«(i-^')z-''e-2"^"^d2dydx 



JQ JO 
1 /.I 



1 ^g-2a.(i-.) _ e-2"(i--))z-''e-2--rf^da; 



a Jo Jo 1-2; 
a Jo Jo 

= - /'i^^I^^dx C e-^^'z-'dz + - C /'-e-2"(-+^)(l-e2"-^)z-''da;dz. 
a Jo X Jo a Jo Jo X 

Using (|4.4p we sec that 

/•I /•! -1 °° /■! /"Set ^n— 1 

Jo Jo X ^ ' ^Jo Jo ri\ 

pi °° n 



1 ^ 



( /" e-2"^z-^log(l-2)dz) 



Wo 

-of C z^-'e-^'^'dz 
1 



(4.12) 



= r(2 - 6')e' 

such that, with (l44l). 



2/or/;^^fe) d.dydx _ 



/o .2-^e-"^d2 a 



= -(log2a + 7e)+0(4). (4.13) 
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For the Yirj-part, we write 

— ) e-^"'dzdydx^ ( / z-''e-2"-dz ) ( / / (-) dydx 



Jx Jo 



1-x \yzJ \Jf) ^^Jo Jx 1-2; Vy 



mi -2a(y~x) , J. -.e 



©( / / / dydxdz 

Jo Jq Jx ^ ^ X 

a 

0{ - / ' z-''e-2"Mog(l - z)dz) 



a 



Oli I z'-'e-^"'dz 



= r(2-0)o(^ 

where we have used (|4.4p in the last step. Hence, by (|4.10p 
2 



Plugging (j4l3l) and (jili]) into (|4T11) gives ([3?2ll . 

For the variance we start with 6 < 1. By (|4.2p and a similar calculation as in the proof of 
Lemma l42l for some finite C (which is independent of 6 and a), using (|4.2p 

V"[T*] = 2/ / t;(u;;0)t;(x;w)da;dw 
Jo Jo 



ia{w,x,v,z) Jq Jo Jn, Jw ^)x{1- ^)\yz J 

" Jo Jo Ja Jo 2a-u)/V.T 2a -x/ ^ yz ^ 

(4.15) 

where the last equality follows by the symmetry of the integrand with respect to y and z. We 
divide the last integral into several parts. Moreover, we use that 

2a t^z f^y f^w /'2a r-z py /•u'Al /'2a r^z r^y r^w 

/ / / ...dxdwdydz = / / / ...dxdwdydz + / / / ■■■dxdwdydz. 
Jo Jo Jo Jo Jo Jo Jo Jl Jl Jl Jl 

(4.16) 
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First, 

— e'^+^-«-~'(^) (x^-^o Al)dxdwdydz 



Jo Jo Jo ^ 

roo nwAl ^-z ,yjx\^ f°° f pW+x-y-z . 

/ / / / ( ) x'^~'^^ dxdwdydz + / / / / dxdwdydz] 

Jo Jo Jo Jo wx\yzJ wx ) 

( - — III e.~^w\ — J dwdydz + / / / / dzdydwdx] 

^2 -6/ Jo J„ Jo \yzJ J wx I 



= O 



^ / / —a-V^^'^dydz^ \ \ dwdx] 



Jo 

oo 



z" ./i ./„ wa; 



Second, since —rrr — r < —ft- — r for x < w, 

^lOL py nW 



(4.17) 



r r r r^^' —^^-(—v x^-^^ dxdwdydz 

lo Jo Jo Jo a;(2a -w)\yzJ 



O 



2a pz py pw ^w+x~y~z 



1 Jl Jl Jl 



x{2a — w) 



dxdwdydz 



= 0{ / / e^M — ) dwdydz + / / / / —, -dzdydwdx 

lo Jo Jo 2a -w \yzJ J^ J^ J^ Jy x{2a - w) 



~2a — iu 

0( / / y'~''—dydz+ / / ^ ^ 



y"' ^—^dydz+ / / ^ ^dwdx 



= 0(1). 

Third, 



(4.18) 



2a nz ny nW 



e^'+^-^-^f— V(x2-2« A \)dxdwdydz 
\ vz J 



Jo Jo Jo {2a-w){2a-x) \yz 

{,w,x,y.z)^ f /•2" /-oo gy+z-^c-^- 

= 0( / / / / dxdwdydz) 

2a-iw,x,y,z) \Jo J^ Jy J ^, WX I 

p2a px /•w py y-\-z—x—w , 

= 0{ / / / dzdydwdx] (4.19) 

^Jo Jo Jo Jo wx J 

=o(r r^^^^^^^^^dwdx 

^ Jo Jo 2;w 

p2a nxAl —X /•2a nx w — x , 

""^( / / dwdx+ / dwdx)=0{l). 

^Jo Jo Jl Jl / 



Plugging (|4T7l) . gUl), (1419]) into (|4l5l) gives jSSl) for e < 1. 
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process 












rate 


1 

Xt 


e i-Xt 

2 Xt 


p(i - Xt) 


pXt 


1 


interpretation 


coalescence 
in B 


mutation 
from B to b 


recombination 
from B to b 


recombination 
from b to B 


coalescence 
in b 



Table 3: Rates of Poisson processes 



For > 1, we compute 



"'0 J w J w 



[T*] = V„,e[r] = 2 / / / / — -[—) dzdydxdw 



w{l — w)x{l — x)\ yz / 



/■I /-z rV j-w ^-2a(w+x-y-z) 

< 4 / / / / — r— -dxdwdydz 

Jo Jo Jo Jo yil-w)z(l-x) 

{w ,y ,z)~ 



2a i*z i*y p'wAl 



0{ / / / — r— 7- -dxdwdydz 

2a{w,x,y,z) \Jo Jq Jq Jq y{2a - w)z{2a - x) 

2a /•2a i'2a i-oo w+x—y—z 



{2a — w)(2a — x)yz 



dzdydwdx^ 



'1 J X J w J y 

1 /■2a rV g-z yj f^l /•2a /•2a (•2q ^w+x-2y 

^\ / / o dwdydz+ / / / jt, ^ ^dydwdx 

Jq Jq Jq y z 2a - w J^ J^ J.^ (2a - w){2a - x)y^ 

1 r2a pz -z p2a-l p2a-l „x-w i , 

1 /•2q <.2a-l -, 2 1 

Va^ a'' J I \x la — X ) / Va^ 

(4.20) 
□ 



5 Key Lemmata 

In this section we prove some key facts for the proof of Theorem[T] Recall p = "f^^^ from p.4p and 
let ^\ ^^,2 Ce' be Poisson-processes conditioned on A" with rates | ^^'^^ , p(l — 

Xf), pX^, 1 and Y^^, at time t, as given in TableEl Moreover, let := sup^f be the last event 
ofef,j = l,...,4in'[0;r]. 

Note that £,i give the pair coalescence rates in B. In addition, coalescences in the wild-type 
background might happen due to events in U S,q since 1 + j^xT ~ i-Xt ' other processes 
determine changes in the genetic background due to mutation {£,2) and recombination {£,3 ,£4)- 

We will prove three Lemmata. The first deals with events of the Poisson processes during 
[0;To]. Recall that To > iff < 1. The second lemma is central for p.6p . i.e., to prove that no 
lines are in the beneficial background at time Tq. The third Lemma helps to order events during 
[Tq; T]. We use the convention that [s; t] = for s > t. 
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Lemma 5.1. Let 6 < I. Then, 



"%Kfn[O;To]^0] = O 



1 



a log a 



'%[C6''n[O;ro]^0] = o(^ 

All error terms are in the limit for large a, are uniform in 9 and uniform on compacta in 7. 
Lemma 5.2. For all values of 6 and a, 

KA^2(^iTo;To+) = ^]=0. 

Lemma 5.3. The bounds 



1 



(5.1) 
(5.2) 



KA^5r^iTo;T]^ 



O 



o 



(5.3) 
(5.4) 

(5.5) 
(5.6) 

(5.7) 



Remark 5.4. Lemmata 15.11 and 15.31 are crucial in ordering events in (recall all rates from 
Table[T|). In particular, let us consider events in [ro;T], i-c, the bounds from Lemma [5T51 The full 
argument for the application of Lemmata 15.1115.31 is given in the proof of Theorem [T] in Section [Sj 



(log a) 



o 



f log a 
\ a 



hold in the limit for large a, are uniform on compacta in 9 and 7. 



Consider a single line (i.e. a sample of size 1). Recall from Table[3]that the processes ^, 



X 



2,3,4 



determine changes in the genetic background due to mutation (^2 ) ^'^id recombination (^3 , ^4 ) . 

As we see from (|5.4p . the event that the line (backwards in time) changes background by 
recombination to the wild-type and back to the beneficial background has a probability of order 
^( (logop') ■ event that the line changes genetic background by mutation and recombines back 
to the beneficial background has a probability of order C'(--^) by ()5.3p . The event of a coalescence 
in the wild-type background requires that both lines change background to the wild-type and so, 
necessarily, one event from (|5.5p . (|5.6p or (|5.7p must take place. Hence, the probability of a 



coalescence event in the wild- type background is of the order 0[ 



Proof of Lemma I5.il For (|5.ip , since Pg is monotone increasing in 9 and P^ q (p) 



we 
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compute, using Lemma 14.21 and p ~ ^( io"a )' 



ls[efn[O;ro]^0]=E°,, 



1 — exp ( — 



To 



pXtdt 



< E 



= P {ta,e{x; 0) - i* 0))x(ix 

^0 

< p / (l-P„\(.T))2;ta,e(a;;0)dx 
Jo 



To 



1 g2a(l-a;) _ 



■2a /"See I- 



Jog a 
Vaioi 



1 -X 



-dx 



log a 



For by a similar calculation, 



/■I 2- 

PM[e6n[0;To] ^0] < / {U,e{x;0)-tl^g{x-0))- dx 

Jo L — X 



O 



1 fl g-2ai,^2 - (,-2a{l-x} 



dydx 1 



1 



< 0{ 



a 
1 



(l-a;)2 

1 (g2a(l-.)_^)(i_g-2a(l-.)) 

(l-a;)2 

2" (e^-l)(l-e-^) , 
7^ dx 



dx 



□ 



Proof of Lemma \5.2\ Note that the process X as wel l as its time- reversion Z = {Zt)t>o with 
Zt Xr-t are special cases of the d iffusion studied inlTavlon (j2007[ ). We use Lemma 2.1 of that 
paper, which extends Lemma 4.4 of Barton et al. 1 20041 ). Their Lemma 2.1 shows that, for all 
< s < T*, 



■ a,e 



Zt 



dt ~ oo 



In particular, 



a,eK2 n[To,ro + s) 



exp 



1. 



To+s 



2 Xt 



-dt 



Hence the result follows for s — > 0. 



□ 



Proof of Lemma[573[ Proof of ([Q]) .- Set y = {Yt)o<t<T' with Yt = Xr-t, i-e. y is the time- 
reversion of (XTn+t)a< t<T* ■ Recall that the Green function of the time-reversed diffusion y is 
given for X < p by (see Ewens ( 20041) . (4.51)) 



t**{x;p) = 2 



a-^{x)iP{x) Jo^{y)dy 
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and for p<x by fsee lEwend (|2004| ). (4.52)) 

1 tp{y)dy i;{y)dy ip{y)dy 



cj^{x)^{x) jli,{y)dyj^^{y)dy 



with the convention that {'i '(1^!'!^ 1 for 6* > 1. Denote by 



T^-* ~ sup{t < T : e} = T - inf{t > : Yt = e}. 



We will use 



KA^^ n [To; T^] ^ 0] < P0,,[ef n [To; T^] ^ 0] + P°,,[rf < T^^] (5.8) 

and bound both terms on the right hand side separately for e ~ e(a) = The bound of the 

first term is established by 



fx 



o 



-2a{x- 



uniformly for e < x < 1 and 



1 — exp ( — p Xtdt 



= o[p 
= o(p 



1 - exp (^^ P J Ytdtj < P J t*a.ei^' s)xdx 



Jx 



^(£)V— ....... /;/-^(£) 



1 /e^ 



dydx 



-2a{y-x} 



1 ry 



dydx + p 



-2a{y- 



1-X 



' dxdy 



©(^^ dx + pj^ log(l- (2/-e))e-2«(«-^)dy) 



while the bound of the second term follows from 



1 — exp 



rpX 2 Xt 



-dt 



<T^-K.e[T*]^0{^). (5.9) 



2e 



/a 



Hence, we have bounded both terms on the right hand side of (|5.8p and thus have proved (|5.3 
Proof of dEll); Note that by g^J 



f (l - exp ( - /" pX.ds) ) p(l - Xt) exp ( - I p(l - X,)d^ 



(1 - Xt) / Xsdsdt 



(5.10) 



/ / ie(2^;0)te(y;2;)a;(l - ?/)fiyda;. 



Jo 

We split the last double integral into parts. First, 



"'0 



t*e (x; OK,e iy;x)xil- y)dydx < Y% [T*] = O (^) 



(5.11) 
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by Proposition p.3p . Second, recall i* g{y, x) = g(y; 0) for x < y. So we have, for all values 
of 61, 

1 /.I pi /.I 

t*a,eix'^0)tl g{y; 0)x{l ~ y)dydx < I / tafi{x]Q)tafi{y\Q)x{l - y)dydx 

»1 /.I 







^Jo A y« ^2^'^ '^-xy I 



f 'J X J X J y 

r2a f>2oL /.2a /.2q 



0(4/ / / / e"(^+^'-^-«) f^V-^-dz'dzdyda; 



Vzz'; 2a-a:y ^"^"^"-^y (5.12) 



Hence, plugging ((5?TT|) and ((5TT2)) into ((5TT0)) establishes ((O)) since p = Cfi^f^ 
Proof of (|5.5p ; We simply observe, using Proposition (23 

' log a ^ 
a 

Proof of (|5.6|) ; We will use the time-reversed process 3^ as in the proof of (|5.3p . Note that 

P°a,eK6'' n [To; T^-^] ^ 0] < P^.eKe'' n [To; f,-^] ^ 0] + pO,,[Tf < T^-^] (5.13) 
and the last term is bounded by (|5.9p . The first term is bounded using 



"lA^, n [ro;T] ^ 0] = KA^ e-^*] < E«,e[r*] = o(^). 



log Q > 



by (recall e = e{a) = ^) 

P^^.Ke'' n [To;ff ] ^ 0] < f t*:,[x-e)-^dx 

Jo I — X 



nl .1 .1 1 



= o 



X {l-x)^\yz 
( e-^"^y-^Uydx + ^ e-2"(--)dx) 



Proof of (lEZl); Note that 

pi pi 

[4^ n [To;Ti^] ^ 0] < p / / i;(«;;0)t;(x;«;)-^(l - x)dxdw. 
Jo Jo 1 - 



^"^'"'^ I I [^0,^3 J 7= wj :i ^ 

/q Jo 

We split the last integral and use that t*{x; w) = t*{x; 0) for w < x, such that 



Jw 



C,e(«^; 0)C,e (a;; 0)^^(1 -x)cixdz/;<0(( f C,e(i«; 0)d«;) ') 



= o((E%[r*])2)=o 



(log a) 
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by Proposition 13. II For the second part, using (14.16^ . we have in the case 6 < 1. by a calculation 
similar to (|4.18p . 



LU 

t* g{w; 0)i* g{x; w)- (1 - x)dxdw 

1 ~ w 

^Jo Jo Jw Jw [i"W) X \yzy 

/ / / / ^0 ^(—)i^Air''dxdwdydz 

Jo Jo Jo Jo x{2a~ wY \yz J 



- / / / / ^ TTT — x^-^" dxdwdydz 

a Jo Jo Jo Jo ^(2a - w)^ \ yz / 



^ p2a i'2a p2a (.2q ^w+x—y—z , 

-^dzdydwdx 1 



1 Jx x(2a-w)^ 

2 /■2a /.z /.y g-2y;2 -[^ 

^2a ^2a ^2a ^^i+x-y^^-y _ g-2Q^ 



1 Jt« x{2a-wY 



dydwdx 



f2a i^z i-yAa ^-z^2 r2a /.2q /.2q 

0(^l II -^d^dydz + e-^a^^^^l (S^^^'^^^- 



JO JO 



a Jj. x{2a — w) 

0{ ^ I e-'z^-^'dz 



^ 7^ \2 ^^^^ J 



2^ /.2a 



r.2a-l 2x . , ;l 



— 4a /•2a ^2a—x ^2a~'w+x Q^f^^2'w _|_ 

" A Jo 

e 



(5.14) 

For > 1, we compute, similar to (|4.20p . 

^ '"^ * , r.. * , . W , , rl rv^ rl rl ^2o.{w+x-y-z) e 

g['w;0)t^ g{x;w)- (1 - x)dxdw = / / / / —r. 75 — dzdydxdw 

1 - w Jo Jo Ju, Jw (1 - i«) a; V / 

i-^^r'- o(i r r p-^ii^dxd^dydz 

2a{w,x,y,z) V a Jq Jo Jo Jo {2a~wYyz 

= 0\ — I I I I — — dxdwdydz 

Jo Jo Jo Jo [2a~wyyz 

^ /•2a /•2a /.2a /•2a ^w+x — y—z^ , 

2 — dzdydwdx j 



a Ji Jx Jw Jy (2a - wfyz 

f2a /•z pyAa /•2a /•2a /•2a /•2a ^w~yx~y — z 

I / / e~^dwdydz^ I I I I i -rr— dzdydwdx 

' Jo Jo otJi J^ J^ Jy {2a-wYx 



since the last term in the second to last line equals the term in the fifth line of (|5.14p such that 
we are done. □ 
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event 


coal in B 


coal in b 


mut from B to b 


rec from B to b 


rec from b to B 


rate 







e \~Xt 

2 Xt 


p(i - Xt) 






Table 4: Transition rates of i]'^ in the interval [0;/3o]- 

6 Proof of Theorem [1] 

Recall the transition rates of the process ^'^ given in Table [TJ We prove Theorem [T] in four steps. 
First, we establish that almost surely, all lines in ^'^ are in the wild-type background by time 
Pq. In Step 2, we give an approximate structured coalescent rj'^ , which has different rates before 
and after /3o. This process already provides us with a good approximation for ^>p„- In Step 3, 
we will use a random time-change of the diffusion X to a, supercritical Feller diffusion y with 
immigration. In Step 4 we will use facts about the connection of the supercritical branching 
process with immigration to a Yule process with immigration. 

6.1 Step 1: All lines in wild-type background by time i3q 

We will show below that all lines in the structured coalescent ^"^ are in the wild-type background 
by time f3Q. 

Proposition 6.1. For all values of 6, a, 

p.,e[e^„ = 0] = i. 

Proof. Note that the structured coalescent can be constructed using a finite number of processes 
^j*' 5 ?2^ ' ^jf J (compare Table [3]). In particular, the escape of lines in the beneficial background 
to the wild-type background due to mutation is given by the processes Moreover, we know 
from Lemma 15.21 that any line in the beneficial background by time Tq + e for some e > will 
experience such an escape since n [Tq; Tq+J 7^ almost surely. Hence the assertion follows. □ 

6.2 Step 2: Approximation of (^>o by r]>o 

In order to define the process rj"^ we use transition rates as given in Tables [4] and [5l Moreover, set 

We will establish that ^>o and ri>o are close in variational distance. 
Proposition 6.2. The bound 

dTv{£.>o,V>o) = o( ^ ). 

\(loga)''/ 

holds in the limit of large a and uniformly on compacta in 71,7 and 6. 

Remark 6.3. Note that {rif)t>po does not depend on X (i.e. rj^pa = {Vt')t>i3o distribution for 
all realizations of X). Using the same argument as in Step 1 all lines of 77^0 are in the wild-type 
background. These two facts together imply that ^>fjg approximately has the same transition 
rates as the finite Kingman coalescent C, which is the statement of (|3.8p . 
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event 


coal in B 


coal in b 


mut from B to b 


rec from B to b 


rec from b to B 


rate 





1 












Table 5: Transition rates of t/"^ in the interval [/3o;oo]. 



Proof of Proposition 1 6. iil Again it is important to note that ^'^ can be constructed using a finite 
number of Poisson processes ^i' , ■ In the same way, rj"^ can be constructed using a finite 
number of Poisson processes Ci^, 'C2', Ca', Cs^ and Poisson processes with rates ■ 

Consider times < /3 < /3o first and recall = sup^,^. A single line may escape the 
beneficial background and recombine back in while this is not possible in 77"^. Such an event 
in requires that either n [To;T^] ^ ill or n [To;T^] ^ for one triple of the processes 
^2 i^i ^ which has a probability of order by (|5.3p and (|5.4p . Hence, ignoring these 

events produces a total variation distance of at most O ^ (^i^^^yi j ■ The coalescence rates in the 

beneficial background of the processes and rj'^ differ by 1. By the bound (|5.5p . the different 
coalescence rates in the beneficial background produce a total variation distance of 0(^^^^). 
Lastly, since ^ ^ ^ i-Xt ' ^® ^^"^ assume that coalescences in the wild- type background 

in ^■^ occur along events of one pair of processes U Such an event requires that either 
i§ n [To; T] 7^ 0, n [To; T^] ^ or n [To; T^] ^ 0. These events together have a probability 
of order 0(--^) by ()5.5p . ()5.6p and ()5.7p and hence, ignoring these events gives a total variation 

distance of order C'(--^). Hence, ^"^ and r/'^ are close for times < /3 < /Jo- 
Let us turn to times P > Po- It is important to notice that, using the same arguments as 
in the proof of Proposition 16.11 Fa.eirjp^ =0] = 1- Note that ry'^ differs from ^'^ by ignoring 
back-recombinations along processes and by changing the coalescence rate in the wild-type 
background from jzx-^ to 1. Considering a single line, ignoring events in produces a total 
variation distance of order ^ ( ^ „ ) by (|5.ip . Hence, we can assume that all lineages are in the 
wild-type background for /3 > Pq. For coalescences in the wild-type background, we are using that 
i_^Xt ~ ^ ~^ i-Xt ^^'-'^ that ignoring events, which occur along one process produces a 

total variation distance of order O^-ij) by (|5.2p . 
Putting all arguments together, we have 



(C>o,??>o) < dTv{^o<p<i3a,ilo<(3<i3o) + drv Po , V> Po) = ^(n ni) 

V (log a)^ / 



□ 



6.3 Step 3: Random time-change to a supercritical branching process 

By a random time change, the diffusion (j2.ip is taken to a supercritical branching process with 
immigration. Specifically, use the random time change dr — {1—Xt)dt to see that the time-changed 
process y ~ {Yr)T>o solves 



dY = {^+aY)dT + VYdW, (6.1) 



stopped when Yr ~ 1, with some Brownian motion (Wt)t>o (see e.g. lEthier and Kurta ljl986r ). 
Theorem 6.1.3). Hence, 3^ is a supercritical branching process with immigration. Analogous to Tq 
and T, define the random times 

To := sup{r > : = 0}, T := inf{r > : = 1} 
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event 


coal in B 


coal in b 


mut from B to b 


rec from B to b 


rec from b to B 


rate 


1 





e 1 

2 Xt 


P 






Table 6: Transition rates of C,^ 



as well as _ _ _ _ 

/3 := T - r, /3o T 

Conditioned on 3^, we define the structured coalescent C,^ 
defined in Table [HI Setting 



'-•Pa 

we immediately obtain the following result. 
Proposition 6.4. For all 6, a and j, 



/3o 



= (C~)r^^o^a with transition rates 

^^/3 '0<P<f3o 



Proof. The pairs {X, rj^) and {y, (^-^) can be perfectly coupled by setting dr ~ (1 
this random time change /?o becomes /?o and hence, the averaged processes -qp^ and C,g_ can also 



be perfectly coupled, leading to a distance of in total variation. 



Xt)dt. Under 

^00 



□ 



6.4 Step 4: Genealogy of 3^ is T 

Proposition 6.5. Let y be a supercritical Feller branching process governed by (|6.ip started in 
and let be the forest of individuals with infinite descent. Then the following statements are 
true: 

1. T = jT[dy]!F^ is a Yule tree with birth rate a and immigration rate a9. 

2. The number of lines in T extant at time T ( when y hits 1 for the first time ) has a Poisson 
distribution with mean 2a. 

3. Given y, the pair coalescence rate of is XjYr and the rate by which migrants occur is 
9j_ 

2 ■ 



Proof. The proposition is analogous to Lemma 4.5 of Etheridge et al. ( 20061 ) and can be proved 
along similar lines. We give an alternative proof ba sed on an approx imation of y by finite models. 
Statement 1. is an extension of Theorem 3.2 of lO'Connelll (1993). Consider a time-continuous 



supercritical Galton- Watson process = {Y/^)t>o with immigration, starting with individuals. 
Each individual branches after an exponential waiting time with rate N. (Note that is a scaling 
parameter and not directly related to the population size.) It splits in two or dies with probabilities 
and respectively. New lines enter the population at rate Then, y^ /N y, the 

|) as ^ cx), if Ns ^^°°) a. Moreover, the probability that an individual of the 
population has an infinite line of descent is 2s + C(s^) for small s. As a consequence, the rate 
of immigration of individuals with an infinite line of descent is 6a in y. In addition, each such 
line has descendants, which have an infinite line of descent. In particular, e ach immigrant w ith 
an infinite line of descent is founder of a Yule tree with branching rate a; see lO'Connelll (|l993l ). 



l+s 
2 

solution of 



For 2., consider times t when Y^ /N = 1, i.e., Y^ = N for the first time. Since all lines have 
an infinite number of offspring independently of each other, each with probability 2s + C(s^), 
the total number of lines with infinite descent is binomially distributed with parameters TV and 
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2s + 0{s^)). In the limit N — > oo, this becomes a Poisson mimber of hnes in J- with parameter 
2a at times t when Yt ~ 1. 

For 3., let = such that /N ^^°°> y. Note that by exchangeability the coalescence 
and mutation rates are the same for lines of finite and infinite descent. Since /N converges to a 
diffusion process, we can assume that suPt_i/jv<s<t \^s^ ^y^\ = 0{y/N). Consider the emergence 
of a migrant first and recall that migrants enter the population at rate independent of . 
Since we pick a specific line among all lines with probability l/y^ , that rate of immigration 

for times \t — 1/N:t] is — r—^ ^~*°°> Next, turn to coalescence of a pair of lines. 

Observe that such events may only occur along birth events forward in time, which occur at rate 
A^j/^ii^. Since the probability that a specific pair out of y^ lines coalesces is 1/ ) find that 
the coalescence rate for times [r — 1/N;t] is 

1 + S 1 N^oo^ I 



Niy^' + OiVN)) 
Hence we are done. 
Proposition 6.6. The bound 



□ 



■ (log ay 

holds for large a and is uniform on compacta in n,"f and 9. 

Proof. The statement as well as its proof is analogous to Proposition 4.7 in lEtheridge et al. I (l2006l) . 
By Proposition [6?5l the random partition arises by picking n lines from the tips of a Yule tree 
with birth rate a with immigration rate a9 and which has grown to a Poisson(2a) number of lines, 
and marking all lines at constant rate p. Hence, the difference of and T arises from 

1. picking from a Yule tree with 1'. picking from a Yule tree with [2aJ 
Poisson(2a) tips tips 

2. a constant marking rate p for all 2'. a marking probability of 1 — 
lines pI^^ (7, 6*) for a branch, which 

starts at Yule-time ii and ends at 
Yule-time 12. 

Both differences only have an effect if they lead to different marks of the Yule tree with immigration. 
To bound the probability of the difference of 1. and 1'., note that the Poisson distribution has a 
variance of 2a and hence, typical deviations are of the order y/a. Given such a typical deviation 
of the Poisson from its mean, the probabil ity of a different mark ing of both Yule trees is of the 
order 0( ^1*^^^ ), as shown below (4.9) in Etheridge et al. ( 20061 ). For the different marks from 



2. and 2'. note first that the probability that two marks occur within any Yule-time is, since the 
marks and splits of the Yule tree having competing exponential distributions, bounded by 



I2a\ 



2 -^2 



< 



7 



00 ^ 

v4 = o 



■^\a{i + 9)+p/ (loga)2^«2 v(logQ; 

Hence, treating these double hits of Yule times differently only leads to a total variation distance 
of 0( ^i^g^yi ) ■ In particular, we may mark all lines of the Yule tree independently (as in T) 
since dependence of marks only arises by double hits of Yule times. The probability that a line 
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that starts in Yule time ii and ends in Yulc-time 12 is not marked, is, again using competing 
exponentials, 

(^U + S) _ TT f f ] \ ^ c( ^ 

+ + 7/loga/ 

'12 

exp f 



}^^^a{]+e)+ p ^. j^-l;^ V '''' V J +0 + 7/ log a y ' j2^V(loga) 

if-^ y 1 ) I gf ) 

V log a j + 6* + 7/ logo;/ V(loga)2/ 



.(loga)2 

Hence, the difference of 2. and 2.' accounts for a total variation distance of oder Oi^ ^T^^^^yi ) and 
we are done. 

□ 



6.5 Conclusion 

Using Propositions 16. f Il6.6l we can now prove Theorem [TJ Note that (|3.6p is the same statement 
as given in Proposition 16.11 Since = almost surely, all ancestral lines of must be in the 
wild-type background and so, using Propositions 16.21 16.41 and 16.61 

For the approximation of C>/3(, by the finite Kingman coalescent C we will use Proposition 16.21 
First, note that by the same reasoning as in the proof of Proposition l6.H IP['7^^ 7^ 0] = 0. Moreover, 
^>po ^ (^)t>o requires a back-recombination event with rate pXt for some time < i < Tq and 
thus, using (|5.1|) . 




Let C be a finite Kingman coalescent that starts with a random number of lines and which is 
distributed like ^J^^. Then, since dTv{S,>p„,C') < drv {^>o , V>o) , 

dTv{e>,,,Toc) < dTvief>,,r)+dTvie>i,,,c')^o[j^^). 

6.6 Sampling at time t < T 

Assume i < T is such that Xt = 1 — 6/ log a for some S > 0. To approximate the number 
of recombination events in [t;T], we can use the time-rescaling to the process y from (j6.ip and 
Proposition l6.5l to note that the Yule process has a Poisson number with parameter 2a (1 — (5/ log a) 
lines at the time the supercritical branching process has Yr = 1 — (5/ log a. Since recombination 
events fall on the Yule tree at constant rate p, the probability of such an event during [t; T] is 

".= L2o(i^/iogo)j * ^ ^ V(loga)2y 

A similar calculation shows that there are no coalescence events in a sample from the Yule tree 
between Yule times [2a(l — 5/loga)J and [2aJ with high probability. 
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